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Summary. Let#A denote the cardinality of a finite set A. Lett(x) = x if x > 1 and 
1 otherwise. For any two sets A, B denote by 5(A, B) — log 2 (t {# (B n A) #j4)) ■ 
We define a new set distance d(A, B) = max {8 (A, B) , 5 (B, A)} motivated by com- 
binatorial notions of entropy and information We prove that d is a semi-metric 
on the space of sets of size at least 2. The triangle inequality, holds for triplets A, 
B, C that are not strictly contained one in another. 
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1 Introduction 

A basic problem in pattern recognition 6, is to find a numerical value that repre- 
sents the dissimilarity or 'distance' between any two input patterns of the domain. 
For instance, between two binary sequences that represent document files or be- 
tween genetic sequences of two living organisms. There are many distances defined 
in different fields of mathematics, engineering and computer and information sci- 
ences [5]. A good distance is one which picks out only the 'true' dissimilarities and 
ignores those that arise from irrelevant attributes or due to noise. In most applica- 
tions the design of a good distance requires inside information about the domain, 
for instance, in the field of information retrieval [4] the distance between two docu- 
ments is weighted largely by words that appear less frequently since the words which 
appear more frequently are less informative. The ubiquitous Levenshtein-distance 
[9] measures the distance between two sequences (strings) as the minimal number of 
edits (insertion, deletion or substitution of a single character) needed to transform 
one string into another. Approximate string matching [TO] is an area that uses such 
edit-distances to find matches for short strings inside long texts. Typically, different 
domains require the design of different distance functions which take such specific 
prior knowledge into account. It can therefore be an expensive process to acquire 
expertise in order to formulate a good distance. The paper of [20] introduced a no- 
tion of complexity of finite binary string which does not require any prior knowledge 
about the domain or context represented by the string (this is sometimes referred to 
as the universal property). This complexity (called the production complexity of a 



2 Joel Ratsaby 



string) is denned as the minimal number of copy-operations needed to produce the 
string from a starting short-string called the base. This definition of complexity is 
related to Levenshtein-distance mentioned above. It is proportional to the number 
of distinct phrases and the rate of their occurrence along the sequence. There has 
been some work on using the LZ-complexity to define a sequence-distance measure 
in bioinformatics [17]. Other applications of the LZ-complexity include: approximate 
matching of strings 16^, analysis of complexity of biomedical signals 2J, recognition 
of structural regularities ,11 , characterization of DNA sequences [7] and responses 
of neurons to different stimuli [3,, study of brain function [18| and brain information 
transmission [T5] and EEG complexity in patients Q]. 

In the current paper we introduce a distance function between two strings which 
also possesses this universal property. Our approach is to consider a binary string as 
a set of substrings [T3] . To represent the complexity of such a set we use the notion 
of combinatorial entropy 12 and introduce a new set distance function. We proceed 
to describe some fundamental concepts concerning entropy and information of sets. 



2 Entropy and information of a set 

Kolmogorov [H] investigated a non-stochastic measure of information for an object 
y. Here y is taken to be any element in a finite space Y of objects. He defines the 
'entropy' of Y as H(Y) = log #Y where #Y denotes the cardinality of Y and all 
logarithms henceforth are taken with respect to 2. 

As he writes, if it is known that Y = {y} then this provides log #Y bits of 
'information' or in his words "this much entropy is eliminated" . To represent partial 
information about Y based on another information source X let R — X x Y be a 
general finite domain and consider a set 

ACR (1) 

that consists of all permissible pairs (x, y) 6 R (in the usual probabilistic-based 
representation of information this is analogous to having a uniform prior probability 
distribution over a certain region of the domain). The entropy of Y is defined as 

H(Y) = log#i7 ¥ (A) 

where 7Ty(^4) = {y G Y : (x, y) G A for some x G X} denotes the projection of A on 
Y. Consider the restriction of A on Y based on x which is defined as 

Y x = {y G Y : (x, y) G A}, x G n x (A) (2) 

then the conditional combinatorial entropy of Y given x is defined as 

H(Y\x) = log#Y x . (3) 

Kolmogorov defines the information conveyed by x about Y by the quantity 

I(x : Y) = H{Y) - H(Y\x). (4) 

In [15] an alternative view of I(x : Y) is defined as the information that a set Y x 
conveys about another set Y satisfying Y x C Y. Here the domain R is defined based 
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on the previous set A as R = IIy(A) x ZZy(A) which consists of all permissible pairs 
(y, y') of objects. Knowledge of a; £ X means knowing the set A x C R, A x = {(y, y') : 
y G 77y(A),j/ G Y x }. The information between Y x and Y is then defined as 

I(Y X : Y) = log {#n Y (A)f - \og#A x 

= log (#/7 Y (i4)) a - log(#i7 ¥ (A)#n). (5) 

Clearly, 7(5^ : Y) = /(a; : Y). Note that I(Y X : Y) measures the difference in 
description length of any pair of objects (y, y') € I7y{A) x 7Ty(A) when no 'labeling' 
information exists versus that when there exists information which labels one of 
them as being an element of Y x . Thus the second term in (JS| can be viewed as the 
conditional combinatorial entropy of IIy{A) given the set Y x . In |12l 1151 IT3" | this is 
used to extend Kolmogorov's combinatorial information to a more general setting 
where knowledge of x still leaves some vagueness about the possible value of y. 

While the distance that we introduce in this paper is general enough for any 
objects, our interest is to introduce a combinatorial distance for binary strings. We 
henceforth refer to X = {0, 1}* as the space of binary strings x. Each string i£X 
is a description of a corresponding set Y x in the space Y of objects y. Our approach 
to defining a distance between two binary strings x and x' is to relate them to sets 
of objects and then measure the distance between the two corresponding sets. Let 
us denote by M : X — >• Y the function which defines how a string x yields a set 
Y x C Y. In general, M may be a many-to-one function since there may be several 
strings (viewed as descriptions of the set) of different lengths for a given set. In the 
context of the above, we now consider a permissible pair (a;, y) € A to be one which 
consists of an object y that is contained in a set Y x which is described by x. Clearly, 
not every possible pair (x,y) is permissible, as for instance, if y' ^ Y x then {x,y') is 
not permissible. 

In the next section we introduce a combinatorial information distance. We start 
with a distance for general sets and then apply it as a distance between binary 
strings. 



3 The distance 

Let Q be a domain. For a finite set A C fl denote by #A the cardinality of A. The 
cardinality of the empty set is zero. Define the following function: 

ft ) — i x if a; > 1 
'1 otherwise . 

Definition 1. For any two finite sets A, B C O define the following function 5 : 
Q X Q — ¥ No which maps a pair of finite sets into the non-negative integers: 

5(A,B) = log (t (# (BnA)#A)) 

where A denotes the complement of the set A. It is simple to realize that 8(A, B) 
equals log (S (B n A) #A) with the exception when A or B are empty or B C A. 

Remark 2. Note that 5 is non-symmetric, i.e., 5(A,B) is not necessarily equal to 
5(B, A). Also, S(A,B) = when B C A (not only when A = B). 
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The definition of 8 resembles in functional form the second log term in © and 
may therefore be interpreted as conditional combinatorial entropy of B given A. 
From an information theoretical perspective ([20]) the value log#(_Bn34) repre- 
sents the additional description length (in bits) of an element in B given a priori 
knowledge of the set A. Hence we may view A as a partial 'dictionary' while the part 
of B that is not included in A takes an additional log #(BflA) bits of description 
given A. 

The following set will serve as the underlying space on which we will consider 
our distance function. It is defined as 

2% = 2 n \ {A C Q : #A < 1} . 

It is the power set of Q but without the empty set and singletons. We note that in 
practice for most domains, as for instance the domain of binary strings considered 
later, the restriction to sets of size greater than 1 is minor. 

We have the following auxiliary lemma which will be useful in the proof of 
Theorem [5] 

Lemma 3. The function S satisfies the triangle inequality on any three elements A, 
B, C £ 2+ none of which is strictly contained in the other. 

Proof. Suppose A, B, C are any elements of 2+ satisfying the given condition. With- 
out loss of generality we will show that 

S(A,C) < S(A,B) + S(B,C). (6) 

First we consider the specific case where the triplet has an identical pair. If A = C 
then by Remark [2] it follows that S(A,C) = which is a trivial lower bound so (|6]) 
holds. If A = B then S(A, B) = and both sides of (JS| are equal hence the inequality 
holds (similarly for the case of B = C). Next we consider the more general case where 
each of the following three quantities is at least 1, 

#(Cn3),#(BnI),#(Cn5)>i. (7) 

By definition of 2+ we have #A > 2 hence 

8{A, C) = log (t(#(C n A)#A)) = log (#(C n A)#A) = log #(Cn3)+ log #A 

Next, we claim that CC\~A C (SnI)u(C(l5). Suppose x e CnA then x e C and 
x £ A. Now, either x g B or x 6 B . Suppose x G B then because x £ Ait follows 
that x G B n A. Suppose x G B then because x £ C it follows that x £ C n B. This 
proves the claim. Next, we have 

S(A,B) + S(B,C) = log#yl + log#(Bnl) + log#B + log#(C*nB). 

It suffices to show that 

log #(cnl)< log #(BnI)+ log # (c n B) + log #s. (8) 

We claim that if three non-empty sets X, Y, Z satisfy X C Y U Z then log #X < 
log(2#y#Z). To prove this, it suffices to show that #X < 2#F#Z. From the 
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given, we have #X < # (Y U Z) < #Y + #Z and the following inequality holds for 
non-empty sets Y and Z: 



2 - y#r - 2 - i 

Hence -jfY < #Z (2#F — 1). Therefore combining the above we have 

#X <#Y + #Z <2#Y#Z 

from which the claim follows. By (0, we may let X = C PI A, Y = B fl A and 
Z = C C\ B and from both of the claims it follows that 

#(CnI)<2#(BnS)#(CnB). (9) 

Taking the log on both sides of ((9J and using the inequality 2 < f^B (which follows 
from B £ 2+) we obtain 

lo g #(C*rL4) < i+lo g #(BnA)+lo g #(CnB) < log#B+log#(Bnl)+log#(C*nB). 

This proves ©. □ □ 
Next, we define the combinatorial information-distance. 

Definition 4. For any two non-empty sets A, B define the combinatorial information- 
distance as 

d (A, B) = max {8 (A, B) , 8 (B, A)} . 
In the following result we show that d satisfies the properties of a semi-metric. 

Theorem 5. The distance function d is a semi-metric on 2+. It satisfies the triangle 
inequality for any triplet A,B,C(z 2+ such that no element in the triplet is strictly 
contained in another. 

Proof. Clearly d is symmetric as d(A, B) = d(B,A). It is also clear that it is non- 
negative. From Remark[5]it is clear that for A — B, 8(A,B) = 8(B,A) = hence 
d(A, B) = 0. For A, B g 2% such that A B (possibly B C A or A C B) then 
8{A, B) > or 8(B,A) > hence d(A, B) > 0. Hence d is a semi-metric on 2+. 

Next, we show that it satisfies the triangle inequality for any triplet A,B,C £ 2^ 
such that no element is strictly contained in another. For any non-negative numbers 
Oi, 0,2, as, b\, 62, 63, that satisfy 

ttl < 02 + (X3 

b 1 <b 1 +b 2 , (10) 

we have 

max {ai, fei} < max {02 + «3, &2 + fe} 

< max {max {02, 62} + max {03, 63} , max {b 2 , 02} + max {63, 03}} 
= max {<i2, 62} + max{a3, 63} . 
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From Lemma [5] it follows that (|10p holds for the following: a\ — 8(A,C), bi — 
5(C, A), a 2 = 8(A, B),b 2 = 5{B, A),a 3 = 5{B, C),b 3 = S{C, B). This yields 

d(A,C) < d(A, B) + d(B, C) 

hence d satisfies the triangle inequality for such a triplet. □ □ 

Let us now define the distance between two binary strings x and x . We take as X 
the space of all finite binary strings and use the concepts and definitions introduced 
in section [2] 

Definition 6. Let R = X x Y be all possible pairs (x, y) and let A C R be the 
set of permissible pairs. For any x £ IIx(A) denote by Y x = {y € Y : (x, y) £ ^4}. 
Let x,x' £ X be two binary strings. Then the combinatorial information distance 
between x and x is defined as 

d(x,x') = d(Y x ,Y x ,) 
where d(Y x ,Y x i) is defined in Definition [4] 

The next result follows directly from Theorem [5] 

Corollary 7. Let Y be a space of objects y and X the space of binary strings x 
describing all sets Y x C Y that have cardinality at least 2. The combinatorial infor- 
mation distance d(x,x') is a semi-metric on X and satisfies the triangle inequality 
for triplets x, x' ,x" whose sets Y x , Y x , , Y x ii are not strictly contained one in another. 

As an example, consider the mapping M that takes binary strings to sets Y in 
Y = {0, l} k (the fc-cube) for some fixed finite k. This resembles the method of [20] 
who break up a binary string s into a set of substrings whose cardinality is taken 
to be the complexity of s. Consider the following scheme for describing a set Y: we 
form a string from the concatenation of all vertices on the fc-cube that are elements 
of the set Y . For instance, suppose k = 5 and Y = {Of 100, 10101, 11110} then any 
of the 6 possible strings that are formed by concatenating the three elements (we 
call them fc-words) of the set in any order represent a possible description of Y . If a 
string has N repeating k- words then clearly only a single copy of these k- words will 
be placed in Y x . Note that the mapping M that takes x to Y x eliminates redundancy 
in a way that is similar to the method of [20] which gives the minimal number of 
copy operations needed to reproduce a string from a set of its substrings. 

Another possible mapping M may be defined by scanning a fixed window of 
length k across the string x and collecting each substring (captured in the window) 
as an element of the generated set Y x . It requires some empirical analysis on real data 
sets to determine the optimal value of the parameter k that yields a good distance. 
Yet another approach which does not need to choose k is to use the method of |20] 
and collect substrings of x (of possibly different lengths) as the set Y x . 

Whichever is the mapping, to compute the combinatorial information distance 
between any two finite strings x and x' first determine the sets of substrings for x 
and x' and let them be Y x and Y x i respectively. The distance d(x, x') according to 
Definition [6] is the set distance d(Y x ,Y x i). Its properties are described in Corollary 

m 
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